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Information retrieval models 

• Documents and queries are characterized by a 
number of index terms 

- Based on a query (representation of an information 
problem), guess the relevance of each document 

- Rank documents in the order of relevance 

- Return the most relevant documents 

• The effectiveness of an I R system depends on the 
ability of the document representation to capture 
the "meaning" of the documents with respect to the 
users' needs 
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Query methods 

• Browsing 

• Adhoc retrieval 

- Document collection remains stable, users try to find 
relevant documents using adhoc queries 

• Filtering 

- User queries remain stable as "profiles" 

- As new documents are added they are sent to users who 
might be interested in these documents 

- Profiles can be constructed on keyword queries, terms 
occurring in documents retrieved by users 
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Information retrieval model 

• An information retrieval model is a quadruple 
<D,Q,F,R(q i) dj)> where 

- D is a set composed of logical views (or representations) 
for the documents in the collection 

- Q is a set composed of logical views (or representations) 
for the user information needs called "queries" 

- F is a framework for modeling document representations, 
queries and their relationships 

- R(qj, dj) is a ranking function which associates a real 
number with a query qj in Q and a document 
representation dj in D. 
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Documents 

• A document is a collection of words 

• An index term is an "important" word that 

- Possess a meaning, such as a noun and has been 
simplified (stop words, stemming) 

- Distinguishes the document from the others 

• The set of all index terms for a document collection 
is given by {^.....kj 

• A document dj in IR is usually given by a vector: 

dj = <w 1 j, w tj > where w^ is the weight of 
term kj in document dj. 
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Documents 

• Assumption: 

- The occurrence of a term ^ in a document is completely 
independent of the occurrence of another term t 2 in the 
same document 

- Not true in general, but does not appear to have a big 
impact on the retrieval effectiveness 
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Boolean model for retrieval 

• A Boolean query contains query terms connected 
by logical connectives and, or not . 

• A Boolean query is interpreted as a set 
membership function. 

• Query: 

- Q = "UFO" return documents that contain the word "UFO" 

- Q = "UFO Sightings" AND "Albany" return documents 
that contain the phrase "UFO Sightings" and the word 
"Albany" 
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Boolean model for retrieval 

• Q = k a and (k b or not k c ) return documents 

- that contain the word k a and 

- either contain k b or does not contain k c 

• In the boolean model, each document either 

- satisfies the query, then we return 1 (relevant) 

- does not satisfy the query, then we return 0 (irrelevant) 

• Documents can be represented as a vector of Os 
and 1s 

- 1 if a term appears and 0 if it does not appear 
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Vector model 

• In the vector model, both queries and documents are 
weighted vectors 

• The relevance of a document to a query is given by the 
"cosine of the angle" between a document vector and a 
query vector 

Sim(d j( q) = sum i=1 „ t (w u . w iq ) / sqrt( sum i=1 ., t (w Sj 2 ) . sum M „ t (w itq 2 ) ) 




Cos(0) = Sim(d j( q) 
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Vector model 

• The importance of a term in a document depends on: 

- How important it is for identifying the content of this document (term 
frequency) 

fij = freq ij /(max l freq M ) 

frequency of term in document d jp versus 
the highest frequency of a term in the 
same document 

- How important it is for identifying the document from the others 
(document frequency) 

idfj = log N/nj total number of documents versus 

total number of documents containing this 
term 

The term weight is given by f M * idfj 
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Vector model 

• A user query consists of a number of terms 

• How do we assign weights to query terms: 

Wj q = (.5 + (.5 freq i q / max, freq, q )) . Log N/rij 
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Fuzzy set model 

• A fuzzy set has a membership function, ^ A (u), that returns a 
real number 0<= |i(A) <= 1 . 

- If n A (u) = 1 , then A is definitely a member 

- If ji A (u) = 0, then A is definitely not a member 

• Fuzzy sets use a number of pre-set functions to determine 
the meaning of various connectives 

- Rnot A (u) = 1-Mu) 

- Ha or b(u) = max {\i A (u), [i B (u)} or n A (u) + ji B (u) 

" ^Aand B (U) = min W). Hb(")} 0r **a(") * *%(") 
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Fuzzy set model 

• Determine the term-to-term correlation in a collection of 
documents between terms k ; and k, 

q | = n ii( / (n.j + n, - n u ) where n x is the number of 

documents containing term k x 

Then, compute ^ = 1 - ( product kl indj (1 - q ,)) 

the degree of membership of document dj to term kj 
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Fuzzy queries 

• Given a query q=kj then similarity of a document dj to q is given by |i u 

• Given a query q= kj AND k„ the similarity of a document dj to query q is 
given by (or using any appropriate operator for AND) 

• Similarly for OR (use + or max) 

• Given a complex query: (A and (not B)) or (C), 
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Extended Boolean Model 

• Suppose, you are given a query containing keywords k x and ky 

• Assume, the weight of terms k x and ky in document dj are given by (x^yj 

• Given query M k x OR k y ", we would like 
to be as far away from (0,0) as possible 
hence maximize distance((0,0), (x^y.,)) 

• Given query \ AND k y ", we would like 
to be as close to (1,1) as possible 
hence maximize 1 - distance((1,1), (x^)) 



(0,0) 

OR 
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Extended Boolean Model 

• Under this model: 

- Sim(or-query, d) = sqrt( (x A 2+y A 2)/2) 

- Sim(or-query, d) = 1 - sqrt( ((1-x) A 2+(1-y) A 2)/2) 

• Suppose now connectives and/or have a degree "p" 

- I.e. or-query: ^ ORp k 2 OR 0 ... ORp k m 

- sim(or-query, d) = power((x1 A p+x2 A p+...+xm A p)/m), 1/p) 

- I.e. and-query: k, ANDp k 2 AND 0 ... ANDp k m 

- sim(and-query, d) = 1 - power(((1-x1) A p+(1-x2) A p+...+(1-xm) A p)/m), 
1/p) 
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Extended Boolean Model 

• Given p-norms, we have the following properties: 

- If p = 1, then sim(or-query)=sim(and-query)= (x1+...+xrn)/m 

- Reduces to arithmetic mean 

- If p = oo, then sim(or-query)= min(xk) and sim(and-query) = max(xk) 
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